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Abstract 

Sentiment analysis predicts the presence of positive or negative emotions in a text document. In this 
paper we consider higher dimensional extensions of the sentiment concept, which represent a richer set 
of human emotions. Our approach goes beyond previous work in that our model contains a continuous 
manifold rather than a finite set of human emotions. We investigate the resulting model, compare it 
to psychological observations, and explore its predictive capabilities. Besides obtaining significant im- 
provements over a baseline without manifold, we are also able to visualize different notions of positive 
sentiment in different domains. 

1 Introduction 

Sentiment analysis predicts the presence of a positive or negative emotion y in a text document x. Despite 
its successes in industry, sentiment analysis is limited as it flattens tlie structure of human emotions into a 
single dimension. "Negative" emotions such as depressed, sad, and worried are mapped to the negative part 
of the real line. "Positive" emotions such as happy, excited, and hopeful are mapped to the positive part of 
the real line. Other emotions like curious, thoughtful, and tired are mapped to scalars near or are otherwise 
ignored. The resulting one dimensional line loses much of the complex structure of human emotions. 

An alternative that has attracted a few researchers in recent years is to construct a finite collection of 
emotions and fit a predictive model for each emotion {p{yi\x),i = 1, . . . , C}. A multi-label variation that 
allows a document to reflect more than a single emotion uses a single model p{y\x) where y € {0, l}*^ is 
a binary vector corresponding to presence or absence of emotions. In contrast to sentiment analysis, this 
approach models the higher order structure of human emotions. 

There are several significant difficulties with the above approach. First, it is hard to capture a complex 
statistical relationship between a large number of binary variables (representing emotions) and a high di- 
mensional vector (representing the document). It is also hard to imagine a reliable procedure for compiling 
a finite list of all possible human emotions. Finally, it is not clear how to use documents expressing a certain 
emotion, for example tired, in fitting a model for predicting a similar mood, for example sleepy. Using 
labeled documents only in fitting models predicting their denoted labels ignores the relationship among 
emotions and is problematic for emotions without many annotated documents. 
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We propose in this paper a different approach for modeUng the human emotions that are expressed in 
text documents. Our approach is motivated by two observations: (a) human emotions are arranged on a low 
dimensional manifold, and (b) it is easier to construct statistical models for low dimensional continuous data 
than for high dimensional discrete data. 

Specifically, we consider a joint distribution over three random objects X, Y, Z where X is a document, 
y is a categorical variable representing the emotion reflected in X, and Z is the corresponding location 
on the manifold of emotions. We posit the statistical relationship X ^ Z ^ Y, implying that Y is 
conditionally independent of X given Z. In other words, the manifold of emotions is a sufficient statistic 
for determining the emotional content of documents. While X, Y are high dimensional and discrete, Z is 
low dimensional and continuous. 

2 Related Work 

Studying emotions and their relations is one of the major goals of the psychology community. Important pa- 
pers studying the low dimensional structure of emotions are |[T8l[T6l[T7l[T9l . Under the context of document 
analysis, [12] survey progress in sentiment analysis over the recent decade. 

Some recent work on mood classification are [4J that used linguistic features to detect emotions of 
internet chatting, and lfT4ll that classified data using the model suggested by ifTSll . ISl used blog posts to 
classify moods with standard machine learning techniques, while Q exploit a mood hierarchy to improve 
classification results. 

191 classified time stamped documents in order to show the changes in public moods over time. ifTTIl used 
a similar approach to compare tweeter sentiment and gallop polls. Q visualize the public moods found in 
Twitter across time. 

3 The Statistical Model 

Several studies in the psychology literature analyzed human survey data to conclude that human emotions 
have a low dimensional structure. The most striking factor conveys a concept similar to positive-negative 
sentiment. Another prominent factor is the engagement level, which includes on one end emotions such as 
quiet and still, and on the other end emotions such as aroused and surprised. While all possible combinations 
of these two factors lead to possible human emotions, some positive correlation exist. See ifTSl [TTl [T6l and 
Figure [T] for more information on these and additional psychological factors. These studies motivate our 
approach of modeling emotions or moods (we use the two terms interchangeably in this paper) on a low 
dimensional continuous space. 

We denote the document, typically in a bag of words or n-gram representation as X, its mood or emotion 
content as multiclass label variable Y S {!,... ,(7}, and the mood manifold coordinates (in the ambient 
space) as Z G M'. Labeled training data typically consists of pairs {x^^\y^^^) where y^*) is the mood that 
is expressed in the document x^'^K In our case of crawled blog entries from live journal . com, the 
authors annotated the entries with their emotions through a rich set of emoticons (see Section 4 for more 
information). 
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Figure 1: The two-dimensional structure of emotions from llTSl . We can interpret top-left to bottom-right 
axis as expressing sentiment and the top-right to bottom-left axis as expressing engagement. 

We make four modeling assumptions: 

1. X^Z^Y 

2. {Z\Y = y}^ N{fly,^y). (1) 

3. {Z\X = x} N{9'^x,E^). 

4. The distances between the vectors in 

{E{Z\Y = y),yeC} 
are similar to the corresponding distances in 

{EiX\Y = y),y€C}. 

The first assumption is consistent with the psychological survey studies: the continuous mood represen- 
tation Z is the internal emotion while the emotion label Y is simply a discretization of the continuous Z. 
The second assumption implies that the distribution over the manifold of emotions given a specific emotion 
is a multivariate Gaussian distribution. The third assumption implies a (multi-response) linear regression 
relationship between Z and X. The fourth assumption states that the spatial proximities of the mood cen- 
troids in the Z space is similar to the spatial proximities between the mood centroids in the bag of words or 
n-gram space. 

The assumptions above may be modified if needed. For example, the Gaussian distribution for Z\Y 
may be replaced with a mixture of Gaussians. The linear regression X ^ Z may be replaced with an 
alternative non-linear regression. We decided on the above model as it is intuitive and simple, it follows 
classical models, it leads to convenient computational schemes, and it works well in practice. 
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3.1 Fitting the Model Parameters and Applying it 

Motivated by Assumption 4, the parameters fiy = ^{Z\Y = y),y G C are determined by running multi- 
dimensional scaling (MDS) or Kernel PCA on the empirical versions of {E(X|y = y),y & C} (replace 
expectation with train set average). 

The parameter 6 defining the regression X ^ Z is fitted by maximizing the conditional likelihood 

§ = arg max log p{y^^^ \x^'^^) (2) 
« i 

= argmaxy^log / p{y^'^^\z)p0{z\x^^^)dz 
e Y -Iz 

= argmaxVlog / p{z\y^''') ^^ l^^^f^/ } dz. 

The CO variance matrices of the Gaussians Z\Y = y, y = 1, . . . ,C may be estimated by computing the 
empirical variance of z values simulated from for all i such that F^*^ = y. Alternatively, the 

samples may be replaced with the most Ukely values 



|argmaxp^(Z|X*-*^) : i = 1, . . . ,ra| . 



Given a new test document x, we can predict the most likely emotion with 



y = arg max / p{y,z\x)dz 
y J 

= argmax / p{y\z)p^{z\x)dz. (3) 
y J 

But in many cases, the distribution p{Z\X) provides more insightful information than the single most likely 
emotion. 



3.2 Approximating High Dimensional Integrals 

Some of the equations in the previous section require integrating over Z G a computationally difficult 
task when / is not very low. There are, however, several ways to approximate these integrals in a computa- 
tionally efficient way. 

The most well-known approximation is probably Markov chain Monte Carlo (MCMC). Another alter- 
native is the Laplace approximation. A third alternative is based on approximating the Gaussian pdf with 
Dirac's delta function, also known as an impulse function, resulting in the approximation 

J N{z;iJ,,'S)g{z)dz^c{^) j 6{z - n)g{z) dz 

= c(J:)gii^)- (4) 

A similar approximation can also be derived using Laplace's method. Obviously, the approximation quality 
increases as the variance decreases. 
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Applying (HJl to (O we get 

~ arg max > log ■ 



arg max^^ logpg(-s:^'^*|x^'^) (5) 



where 

2^*)* = argmaxp(z|y(*)) = E{Z\y'^'^^] 

z 

which is equivalent to a least squares regression. 
Applying (|4l) to (O yields a classification rule 



y ~ arg maxp ( y 
y 



Z = argmaxp^(z|2;) I . (6) 



4 Applications and Experiments 

In this section, we examine some apphcations of our model and report experimental results. 



4.1 Datasets 

We used crawled LivejournaQ data to fit the model parameters. Livejournal is a popular blog service that 
offers emotion annotation capabilities to the authors. About 20% of the blog posts feature these optional 
annotations in the form of emoticons. The annotations may be chosen from a pre-defined list of possible 
emotions, or a novel emotion specified by the author. We crawled 465,945 documents featuring the most 
popular 100 emotions. Two other datasets that we use in our experiments are movie review data lfT3l and 
restaurant review dattH 0. 

We used Indri from the Lemur project to extract term frequency features from these three datasets 
while tokenizing and stemming words. As is common in sentiment studies Ll][l0l|6l we added new features 
representing negated words. For example, the phrase "not good" is represented as a token "not-good" rather 
than as two separate words. This resulted in 31,726 features. 



4.2 Exploring Mood Manifold 

Figure [2] shows the locations of E{Z\Y = y) for the most popular 31 moods, in the first two dimensions of 
the mood manifold. The choice of two dimensions was done for visualization purposes. In later sections we 
indeed consider higher dimensional ambient spaces for the manifold of emotions. 
We make the following observations. 

1. The horizontal axis expresses a sentiment-lrke emotion. The left part features emotions such as 
happy and cheerful, while the right part features emotions such as sad and depressed. This 
is in agreement with Watson's observations (see Figured!) that identify positive-negative sentiment as 
the most prominent factor among human emotions. 
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Figure 2: Mood Centroids E{Z\Y = y) on the two most prominent dimensions in emotion space fitted 
from blog posts. The horizontal dimension corresponds to negative vs. positive sentiments and the vertical 
dimension corresponds to engagement level (compare with Figure [T]). 

2. The vertical axis expresses the level of engagement. The top part features emotions such as thought ful 
or contemplative, while the bottom part features emotions such as bored. This also agrees with 
Watson's psychological model. 

3. The right part is spatially focused while the left part is dispersed. In other words, we have a clear one 
dimensional curve starting on the right, and as it moves to the left it spreads out to fill the space. We 
conclude that there is higher diversity among positive emotions than among negative emotions. 

Another way to analyze the model is by examining which words receive high weights for the differ- 
ent axes. The words with highest weight associated with the horizontal axis are indeed sentiment words: 
{depress, sad, confuse, depression, cry, rip, sigh, upset, died, 
not-understand} on the negative side and {excite, yay, awesome, not-wait, happy, 
welcome, laugh, glad, lol, amaze, proud, haha} on the positive side. 

We conclude that there are agreements between our model and standard psychological models. The sen- 
timent concept emerges as the top factor in both models. The second most prominent factor, the engagement 
level, is also a close match. It is remarkable that the same structure of emotions arise from two different 
data sources: human surveys and annotated blog posts. Our framework may contribute several additional 
markers to psychological models ifTSl [TOll : bored as negative engagement marker, and thoughtful, 
contemplative as positive engagement markers. 
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4.3 Emotions on the Manifold 



The emotion space represented by Z is stochastically related on the emotion label Y . This relation P{Z\Y) 
may also be used to examine the relationship between different emotions. The examination should be con- 
sistent to some extent with our understanding of emotion, though some discrepancies may reveal interesting 
insights. 

Since two emotions are represented by Gaussian distributions on Z, a natural distance measure between 
two emotions is the Hellinger distance between the corresponding densities 

d\f,g) = I [^/m-^M^y dz. (7) 

Since each density integrates to one, ([T]) is equivalent to the Bhattacharyya coefficient 

B{f,g) = -log I f{z)g{z)dz, 

which has the following closed form in the case of two multivariate Gaussians 

EO, A'fe, 4,.. - ..f (5l±I^) - 4 .0. { ^^^^ ) ^ 

Following common practice, we add a small value to the diagonal of the covariance matrices to ensure 
invertibility. 

Figure [3] shows the mood dendrogram obtained by hierarchical clustering the top 31 emotions using the 
Bhattacharyya coefficient (complete linkage clustering). The bottom part of dendrogram was omitted due 
to lack of space. The clustering is in agreement with our intuition. For example, 

1. aggravated and annoyed are in the same tight cluster and close to confused, 

2. sad and depres sed are in the same tight cluster, 

3. bouncy, cheerful, and happy are in the same tight cluster, which is close to accomplished 
and excited, and 

4. bored, sleepy, and tired are in the same tight cluster. 

The hierarchical clustering is useful in many ways. When the original emotions hierarchy is too fine 
(there are over 100 emotions in our data) we may choose to aggregate similar emotions into "super emo- 
tions". If our particular situation requires paying attention to one or two "types" of emotions we can use 
particular mood cluster to reflect the desired feature. For example, when analyzing product reviews we may 
want to partition the emotions into two clusters: positive and negative. When analyzing the effect of a new 
advertisement campaign we may be interested in a clustering based on engagement: excited and energetic 
vs. bored. Other situations may call for other clusters of emotions. 

Figure [2] shows the spatial arrangements of E{Z\Y = y) in the Z space. A more careful analysis should 
also take into consideration the covariance matrices of P{Z\Y = y), rather than just the expectation vectors. 
Figure |4] shows the Voronoi tessellation corresponding to 

f{z) = argmaxp(Z|y = y) 
y=l,...,C 

For space and clarity purposes we use 15 "super-emotions" obtained by clustering the original set of emo- 
tions as described above, instead of the entire set of 31 or 100 top emotions. 
We observe that: 



7 




t 



(U fl u 

I B -g 

c (u >^ 

sad 8 o 

depressed ^ 

cold -S S ^ 

bored, sleepy.tired ^.2 5 

awake,calm,content,curious, hopeful ' 
amused, artistic, busy,chipper,creative o 

bouncy, cheerful, happy ^ J 

excited o ^ S 

accomplished ^ ^ 

contemplative w 



5^ 



blah,blank,exhausted 2 § ^ 

determined,thoughtful g g 

anxious Q 'ti 



confused .. § H 

aggravated, annoyed (U ^ CO 



.11 i i 



1. As in Figure [2] the horizontal axis corresponds to positive-negative emotion and the vertical axis 
corresponds to engagement: thoughtful and contemplative vs. bored and tired. 

2. The depressed region is spread significantly on the bottom-right side, and is neighboring the 
bored, sleepy, tired region and the sad region. 

3. The region corresponding to the bouncy, cheerful, happy emotions neighbors the accomplished 
region and the excited region. 

A similar tessellation of a higher dimensional Z space should provide a finer relationships between human 
emotions. 

4.4 Classifying Emotions 

An important application, analogous to sentiment analysis, is emotion classification. In other words, given 
a document x predict the emotion that is expressed in the text. 

As mentioned in the introduction, it is possible to do that by constructing separate 'p{yi\x) models for 
every emotion (one-vs-all approach). The one vs. all approach is not entirely satisfactory as it ignores 
the relationships between similar and contradictory moods. Why should we not use documents labeled as 
sleepy when we fit a model for predicting tired. On the other hand, it is not clear how to count these 
documents since sleepy and tired are not identical emotions. 

Our framework accounts for the relationship between similar and contradictory emotions automatically 
as it assumes a hidden continuous representation, where P{Z\Y = y) reflects a non-trivial relationship 
between the emotions. Our earlier attempts to construct a manual relationship between emotions based on 
domain knowledge did not perform well. Our current approach is data driven and indeed it outperforms 
the one vs. all approach, as we show below. Our one vs. all baseline is a regularized logistic regression, 
operating in the original bag of words feature space — one of the strongest text classification baselines. 

Since P{Z\Y = y) are Gaussian, the resulting Bayes classifier, which minimizes the classification 
risk, is the well known quadratic discriminant analysis (assuming \/zr{Z\Y = y) depends on y), or the 
well-known linear discriminant analysis (assuming that Var(Z|y = y) does not depend on y). 

We considered three different models for the covariance matrices: full covariance, diagonal covariance, 
and linear combination of full covariance and spherical covariance: 



In either case we used a C dimensional ambient space (C equals the number of emotions) and the approxi- 
mation 

Classification experiments(Figure |5]-l6ll are performed on the Livejoumal data with the most popular 32 
moods, 100 moods, and the 15 clusters from Figure [3] Half of the data is used for the training and the 
other half for testing, t-tests are performed on 10 random trials to determine statistical significance. Low 
accuracies are expected in this task because there are many similar emotions that significantly overlap each 
other. We also designed three sample binary classification tasks obtained by partitioning the set of moods 
into two clusters(positive vs. negative sentiment, engagement vs. boredom, and anger vs. calm). 
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Figure 5: Fl and accuracy over test-set in sentiment task (left): {cheerful, happy, amused} vs {sad, annoyed, 
exhausted}, in detecting engagement level (middle) {tired, bored, sleepy} vs {determined, thoughtful}, 
and in detecting anger (right) {annoyed,aggravated} vs. {calm, content}. Bold text represent statistically 
significant (t-test) improvements over the regularized one vs. all logistic regression baseline in the original 
feature space. 
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Figure 6: Macro Fl score and accuracy over the test set in multiclass emotion classification. Left panel 
shows classification over top 32 moods. Middle panel shows classification of top 100 moods. Right panel 
shows classification of the 15 clusters from Figure [3] Bold text represent statistically significant (t-test) 
improvement over the regularized one vs. all logistic regression in the original feature space. 
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Figure [5] and Figure [6] compare classification results using the emotion manifold model (LDA/QDA 
with different covariance matrix models) and a regularized logistic regression baseline on the original bag 
of words feature. Most of the experimental results show that the emotion manifold model results in a 
statistically significant classification improvement. The improvements are especially noticeable in the Fl- 
measures; it can be seen that mood categories with less training data benefit more from since these minor 
classes contribute more on macro f 1 than accuracy measure. 



4.5 Sentiments and the Emotion Manifold 

The concept of positive-negative sentiment fits naturally within our framework as it is the first factor in the 
continuous Z space. Nevertheless, it is unlikely that all sentiment analysis concepts will align perfectly 
with this dimension. Indeed, it is likely that different sentiment concepts, for example movie reviews and 
restaurant reviews do not represent identical concepts. 

We model a sentiment concept as a smooth one dimensional curve within the continuous Z space. As 
we traverse the curve, we encounter documents corresponding to negative sentiments, changing smoothly 
into emotions corresponding to positive sentiments. We complement the stochastic embedding with 
a smooth probabilistic mapping ■k(R\Z) into the sentiment scale. The prediction rule becomes 

f = arg max ^ p{Z = z\X)Tr{R = r\Z = z) dz 

and its approximated version is 

Z = argmaxP(Z = z\X) I dz. 



f = arg max vr yR - 

Figure |7] (top) shows the smooth curves corresponding to E [vr(i? = r\Z)] for movie reviews and restau- 
rant reviews. Both curves progress from the right (low sentiment) to the left (high sentiment). But the two 
curves show a clear distinction: the movie review sentiment concept is in the top part while the restaurant 
review sentiment concept is in the bottom part. Obviously, the two sentiment concepts are different: movie 
reviews are evidently more thoughtful and creative than restaurant reviews. 

Figure |7] (bottom left and right) show the test Li prediction error of our method and a baseline (regular- 
ized linear regression trained on the original bag of words features) as a function of the train set size. The 
manifold regression performs better than regression on the original bag of words features when the train set 
is small. As the train set increases, the regression on the full bag of words features outperforms the manifold 
model (for n = 4000, the Li difference between the two models for movie reviews on 1-10 scale is 0.198). 

We make the following observations. 

1 . Sentiment concepts in different contexts are not interchangeable. They correspond to different curves 
in the manifold of emotions, as is nicely demonstrated by Figure |2](top). 

2. The model parameters defining X ^ Z ^ Y aie fitted using blogs entries labeled with author 
emotions. The regression model 'k{R\Z) is fitted using a separate sentiment training data. Since Z 
is lower dimensional than the original bag of words, we can expect our approach to be more accurate 
when the labeled sentiment data is scarce. For the same reason, it is more feasible to train a complex 
non-linear model on the manifold of emotions, than on the bag of words representation. 

3. Concepts such as movie or restaurant ratings are not solely captured by manifold of emotions. This 
is not surprising as many review sentences make non-emotional high-level arguments that are not 
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Figure 7: Top: Projected centroids of each review score (higher is better) of movie reviews and restaurant 
reviews on the mood manifold. Both review start from the right side (negative sentiment in mood manifold) 
and continues to the left side (positive sentiment) with two different unique patterns. Movie reviews are ev- 
idently more thoughtful and creative than restaurant reviews. Bottom left and right: Li test prediction error 
on movie review (left) and restaurant review (right) as a function of the sentiment train set size. Prediction 
using the manifold of emotions outperforms the baseline (linear regression) for smaller training set sizes. 
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captured by the continuous Z space. Consider for example, the following sentence, taken from a 
positive review: ''Crumb is a rare and powerful documentary that completely absorbs the viewer and 
leaves an impression so blindingly clear that the afterimage cannot be blinked away even when the 
theater is far behind." This explains the improved performance of the linear regression baseline (using 
the original bag of words features) when there is sufficient training data. 

5 Discussion 

In this paper, we introduced a continuous representation for human emotions Z and constructed a statistical 
model connecting it to documents X and to a discrete set of emotions Y . Our fitted model bears close 
similarities to models developed within the psychological literature, based on human survey data. 

Among the many applications of our model are: discovering the complex relationships between emo- 
tions, clustering of emotions, improved classification of emotions, and sentiment prediction. 

Several attempts were recently made at inferring insights from social media or news data through sen- 
timent prediction. Examples include tracking public opinion [11], estimating political sentiment ||'15ll . and 
correlating sentiment with the stock market |3|. It is likely that a more comprehensive and multivariate view 
of emotions will help make progress on these important and challenging tasks. 
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